Common Crawl
https://commoncrawl.org/
#LLM の訓練に使われる( #ChatGPT )
入手は https://data.commoncrawl.org/ から
https://ja.wikipedia.org/wiki/コモン%E3%83%BBクロール
https://commoncrawl.github.io/cc-crawl-statistics/
Languages https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
The language of a document is identified by Compact Language Detector 2 (CLD2).
👉cld2